What Makes a Potent Nitrosamine? Statistical Validation of Expert-Derived Structure–Activity Relationships

The discovery of carcinogenic nitrosamine impurities above the safe limits in pharmaceuticals has led to an urgent need to develop methods for extending structure–activity relationship (SAR) analyses from relatively limited datasets, while the level of confidence required in that SAR indicates that there is significant value in investigating the effect of individual substructural features in a statistically robust manner. This is a challenging exercise to perform on a small dataset, since in practice, compounds contain a mixture of different features, which may confound both expert SAR and statistical quantitative structure–activity relationship (QSAR) methods. Isolating the effects of a single structural feature is made difficult due to the confounding effects of other functionality as well as issues relating to determining statistical significance in cases of concurrent statistical tests of a large number of potential variables with a small dataset; a naïve QSAR model does not predict any features to be significant after correction for multiple testing. We propose a variation on Bayesian multiple linear regression to estimate the effects of each feature simultaneously yet independently, taking into account the combinations of features present in the dataset and reducing the impact of multiple testing, showing that some features have a statistically significant impact. This method can be used to provide statistically robust validation of expert SAR approaches to the differences in potency between different structural groupings of nitrosamines. Structural features that lead to the highest and lowest carcinogenic potency can be isolated using this method, and novel nitrosamine compounds can be assigned into potency categories with high accuracy.


Naïve feature identification
Tables 3 and 4 in the main manuscript show the most significant features in the nitrosamine dataset using a t-test to test for a difference in mean potency where TD50s are available and Fisher exact test to identify differences in overall carcinogenic call. The full results are given below.

Regression (potency) significance
Difference between compounds with and without each feature using a 2 sample t-test -assessed without the predictive model described. 17 features had sufficient data to run the test giving a Bonferroni corrected p-value threshold of 0.0029.

Prior sensitivity
Being a Bayesian model the method presented in the main text is reliant on prior estimates of the effect of a given feature.
When predicting the effects of potency, the confidence of a feature having an effect on the potency is consistent for all features across a wide range of priors. Similarly, the magnitude of most features is consistent across a wide range with some exceptions. For ring features piperazine (both N4 substituted and unsubstituted) appears dependent on the prior. Both these features have very little compound data available, with 2 examples of a substituted piperazine and only a single example of an unsubstituted. For substitution features both C-aromatic and isopropyl features are consistently predicted to reduce potency but the magnitude of the effect is dependent on the prior, likewise the Benzylic feature is consistently found to increase potency, but the magnitude varies with the prior (see figure S1). As with the ring features each of these has little supporting evidence forcing the model to rely more on the prior than with other features. However it should be noted that the values shown in figure S1 represent maximum likelihood estimates only and the estimates across the full range of priors tested remain within the plausible regions estimated by the model.
The classification model is substantially more sensitive to the choice of prior with most of the ring features being highly dependent on the prior (see figure S2), although like the regression model the confidence in an effect is relatively stable compared to the magnitude. In this case we have erred towards a more conservative prior which attributes minimal changes to the features. The relatively high reliance is likely due to the lack of information contained within a positive/negative call. A TD50 gives information not only on whether a compound is more potent than another but also how much so. The regression model is able to use this information when constructing estimates of the feature effect size. In contrast a positive call does not give any information on how much more likely that call was than the base rate making estimations of effect size more difficult.
S5 Figure S1: Maximum likelihood estimates of feature effect sizes and model confidence across a range of priors for the regression model. Figure S2: Maximum likelihood estimates of feature effect sizes and model confidence across a range of priors for the classification model.

S6
Leave one out cross validation scoring 1 (LOO) was used to assess model goodness of fit across the range of priors tested. All but the most restrictive prior performed equally well for the classification task. For the regression problem tighter priors gave consistently better LOO scores, this is likely due to the fact that the TD50 distribution can be very well described by a single log-normal distribution.
Tighter priors act to force the observed variation into the base distribution representing the featureless nitrosamine, as the base distribution is a very good fit this can be done without significantly decreasing the likelihood of the observations, LOO scoring applies a penalty to account for the effective extra degrees of freedom given by looser priors which then dominates the improvement in likelihood at larger k. S7 attributed to variation in the featureless nitrosamine. While this may be a mathematically valid description there are very strong reasons to expect the features of a nitrosamine to influence its potency, in this situation deciding priors based on the LOO alone is not likely to produce useful results and a prior was chosen to reflect expert assessment (k = 1, giving a 50% prior confidence of feature effects within a 4.93 fold change, and a 95% confidence of within 990 fold). However, as the regression model is relatively insensitive to the choice of prior other choices would give similar results. Figure S3: Goodness of fit given by model LOO scores.

N-nitroso compounds (NOC) dataset
The results presented in the main text focus specifically on dialkyl nitrosamines with other N-nitroso compounds (NOCs) being excluded from the dataset, however the same analysis has been performed on a larger set of N-nitroso compounds with the results shown here.

Naïve Feature Significance
The NOC dataset contains positive/negative calls for 231 compounds and TD50s for 112 compounds.
Like the nitrosamine dataset only the carboxylic acid feature is significant using Fishers exact test with the Bonferroni correction, however in the regression dataset the new feature Nitrosohydroxylamine is significantly less potent than the dataset as a whole (see tables S3 and S4).

S12
Feature dependence Figure

S15
Comparison with expert predictions Figure S7: Comparison between expert predictions and model output for all N-nitroso compounds. Expert predictions were for nitrosamine compounds only and may not generalise to the larger chemical space, however good agreement is still found. This recreates figure 6 in the manuscript for the NOC dataset.

S16
Use as prediction model